MapReduce and PACT - Comparing Data Parallel Programming Models
نویسندگان
چکیده
Web-Scale Analytical Processing is a much investigated topic in current research. Next to parallel databases, new flavors of parallel data processors have recently emerged. One of the most discussed approaches is MapReduce. MapReduce is highlighted by its programming model: All programs expressed as the second-order functions map and reduce can be automatically parallelized. Although MapReduce provides a valuable abstraction for parallel programming, it clearly has some deficiencies. These become obvious when considering the tricks one has to play to express more complex tasks in MapReduce, such as operations with multiple inputs. The Nephele/PACT system uses a programming model that pushes the idea of MapReduce further. It is centered around so called Parallelization Contracts (PACTs), which are in many cases better suited to express complex operations than plain MapReduce. By the virtue of that programming model, the system can also apply a series of optimizations on the data flows before they are executed by the Nephele runtime system. This paper compares the PACT programming model with MapReduce from the perspective of the programmer, who specifies analytical data processing tasks. We discuss the implementations of several typical analytical operations both with MapReduce and with PACTs, highlighting the key differences in using the two programming models.
منابع مشابه
Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملPonIC: Using Stratosphere to Speed Up Pig Analytics
Pig, a high-level dataflow system built on top of Hadoop MapReduce, has greatly facilitated the implementation of data-intensive applications. Pig successfully manages to conceal Hadoop’s one input and two-stage inflexible pipeline limitations, by translating scripts into MapReduce jobs. However, these limitations are still present in the backend, often resulting in inefficient execution. Strat...
متن کاملA Review of CUDA, MapReduce, and Pthreads Parallel Computing Models
The advent of high performance computing (HPC) and graphics processing units (GPU), present an enormous computation resource for Large data transactions (big data) that require parallel processing for robust and prompt data analysis. While a number of HPC frameworks have been proposed, parallel programming models present a number of challenges – for instance, how to fully utilize features in th...
متن کاملComparing Data Processing Frameworks for Scalable Clustering
Recent advances in the development of data parallel platforms have provided a significant thrust to the development of scalable data mining algorithms for analyzing massive data sets that are now commonplace. Scalable clustering is a common data mining task that finds consequential applications in astronomy, biology, social network analysis and commercial domains. The variety of platforms avail...
متن کاملTransMR: Data-Centric Programming Beyond Data Parallelism
MapReduce and related data-centric programming models have proven to be effective for a variety of large-scale distributed computations, in particular, those that manifest data parallelism. The fault-tolerance model underlying these programming environments relies on deterministic replay, which makes data-sharing (side-effects) across computations harder to support. This significantly limits th...
متن کامل